PSCI 8357 - STAT II
Department of Political Science, Vanderbilt University
February 17, 2026
Randomization of the treatment will make the treatment and control groups similar on average with respect to observed and unobserved covariates
Advantage 1: Identification is justified by design of experiments
Advantage 2: Estimation is simple
Advantage 3: Inference is simple
Neyman Approach
Limitation 1: Asymptotic approximation is required for inference
\(\rightsquigarrow\) Inference is not reliable with small sample size
Limitation 2: Variance can be complicated for complex experimental designs.
Fisherian Approach
Gerber, Green, and Larimer (2008) test intrinsic motives in a large scale field experiment by applying varying degrees of extrinsic pressure on voters using series of mailings to 180,002 households before the August 2006 primary election in Michigan.
| Control (Not Mailed) |
Civic Duty (Encouraged to Vote) |
Hawthorne (Encouraged & Monitored) |
Self (Encouraged, Monitored, Shown Own Past Voting) |
Neighbors (Encouraged, Monitored, Shown Own & Others’ Past Voting) |
|
|---|---|---|---|---|---|
| Percent Voting | \(29.7\%\) | \(31.5\%\) | \(32.2\%\) | \(34.5\%\) | \(37.8\%\) |
| \(N\) of Individuals | \(191,243\) | \(38,218\) | \(38,204\) | \(38,218\) | \(38,201\) |
Units: \(i \in \{1, \ldots, N\}\)
Treatment: \(T_i \in \{0, 1\}\), randomly assigned.
Potential outcomes: \(Y_i(0)\) and \(Y_i(1)\).
Observed outcome: \(Y_i = T_i Y_i(1) + (1-T_i) Y_i(0)\) (consistency).
Treatment Assignment Mechanism:
\[ \{Y_i(1), Y_i(0)\} \ {\mbox{$\perp\!\!\!\perp$}}\ T_i \]
Causal Estimand: Still ATE.
\[ \tau_{ATE} \equiv {\mathbb{E}}\{Y_i(1) - Y_i(0)\} \]
Still not directly estimable as we don’t observe \(Y_i(1) - Y_i(0)\) for each unit
RESULT: Identification under Randomization
\[ \begin{aligned} {\mathbb{E}}\{Y_i(1) - Y_i(0)\} &= {\mathbb{E}}\{Y_i(1)\} - {\mathbb{E}}\{Y_i(0)\} \quad \text{($\because$ linearity of ${\mathbb{E}}$)} \\ &= {\mathbb{E}}\{Y_i(1) {\:\vert\:}T_i = 1\} - {\mathbb{E}}\{Y_i(0) {\:\vert\:}T_i = 0\} \quad \text{($\because$ randomization of $T_i$)} \\ &= {\mathbb{E}}[Y_i {\:\vert\:}T_i = 1] - {\mathbb{E}}[Y_i {\:\vert\:}T_i = 0] \quad \text{($\because$ consistency of PO)} \end{aligned} \]
\[ \frac{1}{N_1} \sum_{i=1}^N T_i Y_i - \frac{1}{N_0} \sum_{i=1}^N (1 - T_i) Y_i \]
This implies \[ {\mathbb{E}}\{Y_i(1)\} \neq {\mathbb{E}}\{Y_i(1) {\:\vert\:}T_i = 1\}, \quad {\mathbb{E}}\{Y_i(0)\} \neq {\mathbb{E}}\{Y_i(0) {\:\vert\:}T_i = 0\} \]
Without randomization, treatment and control groups are different with respect to pre-treatment covariates.
Pre-treatment covariates: Variables that are not affected by the treatment.
Importantly, potential outcomes are pre-treatment covariates!
Note: observed outcomes are post-treatment covariates!
Consider finite population and focus on design-based inference
Treatment variables \((T_1, \ldots, T_N)\) are random.
Units and potential outcomes (\(Y_i(1), Y_i(0)\)) are fixed.
We now distinguish Sample Average Treatment Effect (SATE):
\[ \tau_{SATE} \equiv \frac{1}{N} \sum_{i=1}^N \{Y_i(1) - Y_i(0)\} \]
Randomization is the “reason basis for inference” (Fisher 1936)
Design-based inference:
Unbiased for the SATE under complete randomization – no modeling assumption is needed.
First suppose, \(\mathcal{O}_N = \{Y_i(1), Y_i(0)\}_{i=1}^N\), then:
\[ \begin{aligned} {\mathbb{E}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N] &\class{fragment}{{}= \frac{1}{N_1} \sum_{i=1}^N {\mathbb{E}}[T_i Y_i {\:\vert\:}\mathcal{O}_N] - \frac{1}{N_0} \sum_{i=1}^N {\mathbb{E}}[(1 - T_i) Y_i {\:\vert\:}\mathcal{O}_N] \quad \text{($\because$ linearity of ${\mathbb{E}}$)}} \\ &\class{fragment}{{}= \frac{1}{N_1} \sum_{i=1}^N {\mathbb{E}}[T_i Y_i(1) {\:\vert\:}\mathcal{O}_N] - \frac{1}{N_0} \sum_{i=1}^N {\mathbb{E}}[(1 - T_i) Y_i(0) {\:\vert\:}\mathcal{O}_N] \quad \text{($\because$ consistency of PO)}} \\ &\class{fragment}{{}= \frac{1}{N_1} \sum_{i=1}^N {\mathbb{E}}[T_i {\:\vert\:}\mathcal{O}_N] Y_i(1) - \frac{1}{N_0} \sum_{i=1}^N {\mathbb{E}}[1 - T_i {\:\vert\:}\mathcal{O}_N] Y_i(0) \quad \text{($\because$ POs are fixed)}} \\ &\class{fragment}{{}= \frac{1}{N} \sum_{i=1}^N Y_i(1) - \frac{1}{N} \sum_{i=1}^N Y_i(0) \quad \text{($\because$ complete randomization)}} \end{aligned} \]
Inverse Probability Weighting Estimator (Horvitz–Thompson estimator) for the SATE:
\[ \widehat{\tau}_{IPW} \equiv \frac{1}{N} \sum_{i=1}^N \left\{\frac{T_iY_i}{p_i} - \frac{(1 - T_i) Y_i}{(1 - p_i)}\right\}, \]
where \(p_i = {\textrm{Pr}}(T_i = 1 {\:\vert\:}\mathcal{O}_N)\).
This estimator is more general:
\[ {\mathbb{V}}[\overline{Y}] = \frac{S^2}{n} \underbrace{\frac{N - n}{N}}_{FPC} \]
Intuition: The factor \(\frac{N-n}{N} = 1 - \frac{n}{N}\) reduces variance because sampling w/o replacement removes variability by removing units from population
\[ \begin{align*} {\mathbb{V}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N] &= {\mathbb{V}}[\overline{Y}_1 {\:\vert\:}\mathcal{O}_N] + {\mathbb{V}}[\overline{Y}_0 {\:\vert\:}\mathcal{O}_N] - 2 {\mathrm{cov}}(\overline{Y}_1, \overline{Y}_0 {\:\vert\:}\mathcal{O}_N) \quad \text{($\because$ variance of difference)} \\ &= \frac{S_1^2}{N_1} \frac{N - N_1}{N} + \frac{S_0^2}{N_0} \frac{N - N_0}{N} + 2 \frac{\textcolor{#d65d0e}{S_{10}}}{N} \quad \text{($\because$ \textit{FPC};}\; {\mathrm{cov}}(\overline{Y}_1, \overline{Y}_0) = -S_{10}/N \href{#proof-cov-neg}{\text{(proof)}}\text{)} \\ &= \frac{S_1^2}{N_1} + \frac{S_0^2}{N_0} - \frac{S_1^2 + S_0^2 -2 \textcolor{#d65d0e}{S_{10}}}{N} \quad \text{($\because$ rearranging terms)} \\ &= \frac{S_1^2}{N_1} + \frac{S_0^2}{N_0} - \frac{\textcolor{#d65d0e}{S_{\tau}^2}}{N} \quad \text{($\because$ variance of difference)} \end{align*} \]
RESULT: Conservative Variance Estimator
\[ \widehat{\sigma}^2= \frac{1}{N_1} \widehat{S}_{1}^2 + \frac{1}{N_0} \widehat{S}_{0}^2 \] where \(\widehat{S}_{t}^2 = \frac{1}{N_t - 1} \sum_{i=1}^N \mathbb{1} [T_i = t] (Y_i - \overline{Y}_t)^2\), and in turn \(\overline{Y}_t = \frac{1}{N_t} \sum_{j=1}^N \mathbb{1} [T_i = t] Y_j\).
So far, we have assumed for simplicity that our data represent the entire population.
In reality, we often treat our experiment as a sample from the population.
Sampling introduces an additional layer of uncertainty in causal inference:
\(\rightsquigarrow\) Focus on Population ATE.
How does this affect our inference, in terms of:
Assumption: simple random sampling from a super-population
Population Average Treatment Effect (PATE) is:
\[ \tau_{PATE} \equiv {\mathbb{E}}[Y_i(1) - Y_i(0)] \]
DiM is unbiased (over repeated sampling and treatment assignment):
\[ {\mathbb{E}}[\widehat{\tau}_{DiM}] = {\mathbb{E}}\left[ {\mathbb{E}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N] \right] = {\mathbb{E}}[\tau_{SATE}] = {\mathbb{E}}[Y_i(1) - Y_i(0)] = \tau_{PATE} \]
Important: Often obtaining such a sample is impossible \(\rightsquigarrow\) External Validity
Now let’s characterize total uncertainty (sampling + design) for SATE, \({\mathbb{V}}[\widehat{\tau}_{DiM}]\).
Law of Total Variance: \(\underbrace{{\mathbb{V}}(Y)}_{\text{total variance}} = \underbrace{{\mathbb{E}}[{\mathbb{V}}(Y\mid X)]}_{\text{(mean of) "within" variance}} + \underbrace{{\mathbb{V}}({\mathbb{E}}[Y\mid X])}_{\text{"between" variance}}\) (see Angrist and Pischke 2009, Ch. 3).
Applying the LTV and denoting the population variance of \(Y_i(t)\) and \(\tau\) as \(\sigma_{t}^2\) and \(\sigma_{\tau}^2\):
\[ \begin{align*} {\mathbb{V}}[\widehat{\tau}_{DiM}] &= {\mathbb{E}}\left[{\mathbb{V}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N]\right] + {\mathbb{V}}\left[{\mathbb{E}}[\widehat{\tau}_{DiM}\mid \mathcal{O}_N]\right] \\ &= {\mathbb{E}}\left[\frac{S_{Y_1}^2}{N_1} + \frac{S_{Y_0}^2}{N_0} - \frac{S_{\tau}^2}{N}\right] + {\mathbb{V}}[\tau_{SATE}] \\ &= {\mathbb{E}}\left[\frac{S_{Y_1}^2}{N_1} + \frac{S_{Y_0}^2}{N_0} - \frac{S_{\tau}^2}{N}\right] + {\mathbb{V}}\left[\frac{1}{N}\sum_{i \in \mathcal{O}_N}\tau_i\right] \\ &= {\mathbb{E}}\left[\frac{S_{Y_1}^2}{N_1} + \frac{S_{Y_0}^2}{N_0} - \frac{S_{\tau}^2}{N}\right] + \frac{\sigma_{\tau}^2}{N} \\ &= \frac{\sigma_{1}^2}{N_1} + \frac{\sigma_{0}^2}{N_0}. \end{align*} \]
This is in contrast to the result for the SATE:
\[ {\mathbb{E}}[\widehat{\sigma}^2 {\:\vert\:}\mathcal{O}_N] \geq {\mathbb{V}}[\widehat{\tau}_{DiM} {\:\vert\:}\mathcal{O}_N] \]
Intuition: This variance estimator was always too large for SATE–we overestimated the variability.
But for PATE, because we have additional uncertainty, this becomes an unbiased estimator.
“Weak” null hypothesis (for PATE) (Neyman):
\[ H_0^{\text{weak}}: {\mathbb{E}}[Y_i(1) - Y_i(0)] = 0 \quad \text{vs.} \quad H_a^{\text{weak}}: {\mathbb{E}}[Y_i(1) - Y_i(0)] \neq 0 \quad (\text{two-sided}) \]
PATE
\[ \frac{\widehat{\tau}_{DiM}- \tau_{PATE}}{\sqrt{\sigma^2_1/N_1 + \sigma_0^2/N_0}} \xrightarrow{d} \mathcal{N}(0, 1) \]
SATE
\[ \frac{\widehat{\tau}_{DiM}- \tau_{SATE}}{\sqrt{S^2_1/N_1 + S_0^2/N_0 - S^2_{10}/N}} \xrightarrow{d} \mathcal{N}(0, 1) \]
For a binary treatment (\(T_i \in \{0,1\}\)) we can show:
\[ \widehat\beta_{OLS} \ \equiv\ \frac{\widehat{{\mathrm{cov}}}(Y_i, T_i)}{\widehat{{\mathbb{V}}}(T_i)} \ =\ \widehat{\tau}_{DiM} \]
\[ \widehat\sigma^2_{HC2} \ = \frac{1}{N_1} \widehat{S}_1^2 + \frac{1}{N_0} \widehat{S}_0^2 \]
In practice in a completely randomized experiment:
estimatr::lm_robust()).# load packages
pacman::p_load(
tidyverse,
labelled,
haven,
estimatr,
sandwich
)
# load data
gerber <- haven::read_dta("../_data/gerber.dta")
# check how treatment and outcome are coded
# labelled::get_value_labels(gerber$treatment)
# labelled::get_value_labels(gerber$voted)
# calculate difference-in-means by hand
est_dim <-
gerber |>
(
\(.)
c(
hawthorne = mean(.$voted[.$treatment == 1]) -
mean(.$voted[.$treatment == 0]),
civic = mean(.$voted[.$treatment == 2]) -
mean(.$voted[.$treatment == 0]),
neighbor = mean(.$voted[.$treatment == 3]) -
mean(.$voted[.$treatment == 0]),
self = mean(.$voted[.$treatment == 4]) -
mean(.$voted[.$treatment == 0])
)
)()
# calculate difference-in-means using regression
est_lm <-
estimatr::lm_robust(
voted ~ factor(treatment),
data = gerber
) |>
estimatr::tidy() |>
dplyr::pull(estimate)
bind_cols(
treatment = names(est_dim),
est_dim = unname(est_dim),
est_lm = est_lm[-1]
) |>
knitr::kable(digits = 3) |>
kableExtra::kable_minimal(font_size = 20)| treatment | est_dim | est_lm |
|---|---|---|
| hawthorne | 0.026 | 0.026 |
| civic | 0.018 | 0.018 |
| neighbor | 0.081 | 0.081 |
| self | 0.049 | 0.049 |
# calculate standard errors by hand
s_2 <-
gerber |>
(
\(.)
c(
control = var(.$voted[.$treatment == 0]) /
sum(.$treatment == 0),
hawthorne = var(.$voted[.$treatment == 1]) /
sum(.$treatment == 1),
civic = var(.$voted[.$treatment == 2]) /
sum(.$treatment == 2),
neighbor = var(.$voted[.$treatment == 3]) /
sum(.$treatment == 3),
self = var(.$voted[.$treatment == 4]) /
sum(.$treatment == 4)
)
)()
se_hand <- sapply(2:5, function(i) {
sqrt(s_2[i] + s_2["control"])
})
# can calculate by hand
se_sandwich <-
lm(voted ~ factor(treatment), data = gerber) |>
vcovHC(type = "HC2") |>
diag() |>
sqrt()
# can get it directly now
se_robust <-
estimatr::lm_robust(
voted ~ factor(treatment),
data = gerber,
se_type = "HC2"
) |>
estimatr::tidy() |>
dplyr::pull(std.error)
bind_cols(
treatment = names(s_2[-1]),
se_hand = se_hand,
se_sandwich = se_sandwich[-1],
se_robust = se_robust[-1],
) |>
knitr::kable(digits = 5) |>
kableExtra::kable_minimal(font_size = 20)| treatment | se_hand | se_sandwich | se_robust |
|---|---|---|---|
| hawthorne | 0.00261 | 0.00261 | 0.00261 |
| civic | 0.00259 | 0.00259 | 0.00259 |
| neighbor | 0.00269 | 0.00269 | 0.00269 |
| self | 0.00265 | 0.00265 | 0.00265 |
Lady Tasting Tea Experiment
Units: 8 identical cups
Randomization: Randomly choose 4 cups into which the tea is poured first, and for the other four, the milk was poured first
Null hypothesis: the lady cannot tell the difference
Statistic: the number of correctly classified cups
Outcome: The lady classified all 8 cups correctly!
\(\binom{8}{4}= \frac{8!}{4!(8-4)!} = 70\) ways to pour teas each corresponding to number of correct guesses
only \(1\) corresponds to guessing all cups correctly!
Under the \(H_0\) (lady guessing at random), the probability that the lady classifies all cups correctly is \(1/70 \approx 0.014\).
\(p = {\textrm{Pr}}(\text{guessing all cups correctly}\) \({\:\vert\:}\text{guessing at random}) = 0.014\) \(\rightarrow\) Reject the null of guessing at random!
Fisher’s sharp null hypothesis: \(Y_i(1) = Y_i(0)\) for all units.
Key idea: Under the sharp null, we “observe” all potential outcomes!
We can compute the exact \(p\)-value to test this sharp null hypothesis.
| Voters \(i\) | Contact \(T_i\) | Turnout \(Y_i\) | Potential Turnout \(Y_i(1)\) | Potential Turnout \(Y_i(0)\) |
|---|---|---|---|---|
| 1 | 1 | 1 | 1 | ? |
| 2 | 0 | 0 | ? | 0 |
| 3 | 1 | 1 | 1 | ? |
| 4 | 1 | 0 | 0 | ? |
| 5 | 0 | 1 | ? | 1 |
Estimate: \(\widehat{\tau} = \frac{2}{3} - \frac{1}{2} = \frac{1}{6}\)
Is this statistically significant? How do we compute \(p\)-value?
| Voters \(i\) | Turnout \(Y_i\) | Contact \(T_i\) | \(\widetilde{T}^1_i\) | \(\widetilde{T}^2_i\) | \(\widetilde{T}^3_i\) | \(\ldots\) |
|---|---|---|---|---|---|---|
| 1 | 1 | 1 | 1 | 1 | 1 | \(\ldots\) |
| 2 | 0 | 0 | 1 | 1 | 0 | \(\ldots\) |
| 3 | 1 | 1 | 1 | 0 | 1 | \(\ldots\) |
| 4 | 0 | 1 | 0 | 1 | 0 | \(\ldots\) |
| 5 | 1 | 0 | 0 | 0 | 1 | \(\ldots\) |
| \(\widehat{\tau}\) | \(\frac{1}{6}\) | \(-\frac{2}{3}\) | \(1\) | \(\ldots\) |
The null (\(\approx\) sampling) distribution of the test statistic is \(\{\widehat{\tau}_k\}_{k=1}^K\), where \[ \widehat{\tau}_k = \frac{\sum_{i=1}^N \widetilde{T}^k_i Y_i}{\sum_{i=1}^N \widetilde{T}^k_i} - \frac{\sum_{i=1}^N (1-\widetilde{T}^k_i) Y_i}{\sum_{i=1}^N (1-\widetilde{T}^k_i)} \]
Exact (two-sided) \(p\)-value is \(p = \frac{1}{K} \sum_{k=1}^K \mathbb{1} \left[ |\widehat{\tau}_k| > |\widehat{\tau}|\right]\), where \(\widehat{\tau}\) is the observed test statistic
# load data
gerber <- haven::read_dta("../_data/gerber.dta")
# observed test statistics
lm_obs <- lm(voted ~ factor(treatment), data = gerber)
obs_dim <- coef(lm_obs)[2:5]
# Fisher’s exact test
sim_dim <-
pbapply::pbreplicate(1000, {
sim_treatment <-
sample(gerber$treatment,
size = length(gerber$treatment), replace = FALSE
)
lm_sim <- lm(gerber$voted ~ factor(sim_treatment))
coef(lm_sim)[2:5]
}, cl = 8)
# p-values
mean(abs(sim_dim[1,]) > abs(obs_dim[1])) # two-sided for Hawthorne
mean(abs(sim_dim[2,]) > abs(obs_dim[2])) # two-sided for Civic
mean(abs(sim_dim[3,]) > abs(obs_dim[3])) # two-sided for Neighbors
mean(abs(sim_dim[4,]) > abs(obs_dim[4])) # two-sided for SelfDifference-in-Means (or an estimator of the ATE):
Difference-in-Mean-Ranks (for continuous outcomes):
\[ S = \left|\frac{\sum_{i=1}^N T_i R_i}{\sum_{i=1}^N T_i} - \frac{\sum_{i=1}^N (1-T_i) R_i}{\sum_{i=1}^N (1-T_i)} \right| \]
Specify a sharp null hypothesis
Choose a test statistic \(S = f(\{Y_i, T_i, \tau_{0i}\}_{i=1}^N)\)
The California Alphabet Lottery (Ho and Imai 2006).
Randomization sometimes occurs in the real world.
Started in 1975: “[B]oth the ‘incumbent first’ and ‘alphabetical order’ procedures are constitutionally impermissible.” (Gould v. Grubb, 14 Cal. 3d 661, 676).
A random alphabet is drawn for every statewide election that applies to all statewide offices.
Candidates are ordered by this randomized alphabet for the first of 80 assembly districts and are rotated for each subsequent assembly district.
“Each letter of the alphabet shall be written on a separate slip of paper, each of which shall be folded and inserted into a capsule. Each capsule shall be opaque and of uniform weight, color, size, shape, and texture. The capsules shall be placed in a container, which shall be shaken vigorously in order to mix the capsules thoroughly. The container then shall be opened and the capsules removed at random one at a time. As each is removed, it shall be opened and the letter on the slip of paper read aloud and written down. The resulting random order of letters constitutes the randomized alphabet, which is to be used in the same manner as the conventional alphabet in determining the order of all candidates in all elections. For example, if two candidates with the surnames Campbell and Carlson are running for the same office, their order on the ballot will depend on the order in which the letters M and R were drawn in the randomized alphabet drawing.”
Take into account the complex lottery procedure
\(\rightsquigarrow\) Impossible via model-based inference!
Ho and Imai (2006) rely on data from the 2003 CA Gubernatorial Recall Election
Setup:
Randomized alphabet (2003 recall election):
R W Q O J M V A H B S G Z X N T C I E K U P D Y F L
Practical Considerations
In most experiments, researchers focus on the ATE and use the Neyman approach to construct confidence intervals.
Because a sharp null hypothesis is often not interesting to social scientists.
Consider the Fisherian exact test
When you have a small sample size (avoid if possible!), or
When you have a complex (but known!) treatment assignment mechanism (e.g., natural experiment).
Extensions
Common practice: Conduct balance checks with respect to observed pre-treatment covariates.
RItools::xbalance().Can correct imbalance via regression, matching, weighting, etc. (more on matching and weighting later).
Covariate adjustment can also improve efficiency–that is, reduce the randomization/sampling distribution variance of our estimate of \(\tau\) while maintaining consistency.
But it may also produce bias, such as:
Pitfall 2 — Common-slope constraint: Naive adjustment forces parallel regression lines with respect to \(X\) and limits precision gains.
\[ Y_i = \alpha + \tau T_i + (X_i - \overline{X})' \gamma + T_i (X_i - \overline{X})' \delta + \varepsilon_i \]
This addresses both pitfalls:
# simulate data
set.seed(972)
N <- 100
X <- rnorm(N, mean = 3)
D <- randomizr::complete_ra(N)
Y <-
2 + 1.5 * D + 0.1 * X +
3 * D * (X - mean(X)) + rnorm(N, sd = 1.5)
# simple difference-in-means
dim_model <-
estimatr::lm_robust(Y ~ D) |> estimatr::tidy()
# naive covariate adjustment (common slopes)
unadjusted_model <-
estimatr::lm_robust(Y ~ D + X) |> estimatr::tidy()
# demean covariates
X_centered <- X - mean(X)
# Lin estimator (hand-coded)
adjusted_model <-
estimatr::lm_robust(Y ~ D * X_centered) |> estimatr::tidy()
# Lin estimator (estimatr)
adjusted_model2 <-
estimatr::lm_lin(Y ~ D, covariates = ~X) |>
estimatr::tidy()
bind_rows(
dim_model, unadjusted_model, adjusted_model, adjusted_model2
) |>
dplyr::filter(term == "D") |>
dplyr::mutate(model = c("unadjusted", "naive", "Lin (hand)", "Lin (estimatr)")) |>
dplyr::select(model, term, estimate, std.error) |>
knitr::kable(digits = 5, align = "lccc") |>
kableExtra::kable_minimal(font_size = 22)| model | term | estimate | std.error |
|---|---|---|---|
| unadjusted | D | 1.43687 | 0.50032 |
| naive | D | 1.52321 | 0.46324 |
| Lin (hand) | D | 1.55665 | 0.30858 |
| Lin (estimatr) | D | 1.55665 | 0.30858 |
In some experiments, units may have different probabilities of being assigned to the treatment group.
This can occur due to:
Block (stratified) randomization with different probabilities within strata or unequal sized clusters.
Practical constraints or ethical considerations.
Adaptive designs where probabilities change over time (Offer-Westort, Coppock, and Green 2021).
Problem: Unequal probabilities can bias the Difference-in-Means (DiM) estimator.
Solution: Use the Inverse Probability Weighting (IPW) that we introduced before estimator to account for unequal probabilities.
Recall the DiM estimator: \(\widehat{\tau}_{DiM} = \frac{1}{N_1} \sum_{i=1}^N T_i Y_i - \frac{1}{N_0} \sum_{i=1}^N (1 - T_i) Y_i\)
Suppose \(p_i = {\textrm{Pr}}(T_i = 1)\) varies across units, but we have complete random assignment (\(N_1\) and \(N_0\) are fixed).
\[ \begin{align*} {\mathbb{E}}[\widehat{\tau}] &= {\mathbb{E}}\left[ \frac{1}{N_1}\sum_{i=1}^N T_i Y_i(1) \right] - {\mathbb{E}}\left[ \frac{1}{N_0}\sum_{i=1}^N (1 - T_i) Y_i(0) \right] \\ &= \sum_{i=1}^N {\mathbb{E}}\left[\frac{T_i}{N_1}\right] Y_i(1) - \sum_{i=1}^N {\mathbb{E}}\left[\frac{(1 - T_i)}{N_0}\right] Y_i(0) \\ &= \frac{\sum_{i=1}^N p_i\,Y_i(1)}{\sum_{j=1}^N p_j} - \frac{\sum_{i=1}^N (1-p_i)\,Y_i(0)}{\sum_{j=1}^N (1-p_j)} \\ &\neq \tau_{ATE}. \end{align*} \]
\[ \widehat{\tau}_{IPW} = \frac{1}{N} \sum_{i=1}^N \left( \frac{T_i Y_i}{p_i} - \frac{(1 - T_i) Y_i}{1 - p_i} \right) \]
Properties:
estimatr::lm_robust() with weights argument).Intuition: Weight each unit by the inverse of its probability of receiving the observed treatment to compensate for over-weighting.
Note: The IPW estimator is unbiased and consistent if the treatment assignment mechanism is known and correctly specified.
So far, we have assumed treatments are assigned at the individual level.
Sometimes random assignment occurs at the cluster level for various reasons:
Treatment only makes sense at the group level, but the outcome is measured for individuals.
Treatment too costly to implement individually.
SUTVA only plausible if treatment is defined at the group level.
Example: Effect of classroom teaching method on student performance.
Standard errors ignoring cluster randomization are usually too small (opposite of conservative).
This is due to units within the same cluster typically being more similar than units in different clusters.
Warning
“Analyses of group randomized trials that ignore clustering are an exercise in self-deception.” (Cornfield 1978)
When clusters have unequal sizes, the DiM estimator can produce biased estimates of \(\tau_{ATE}\) if the cluster sizes are correlated with the potential outcomes.
Intuition: The DiM estimator gives equal weight to each cluster, regardless of size, which can lead to biased estimates if cluster sizes are not balanced.
To address the issue, as before, we can:
Recall the Law of Total Variance: \({\mathbb{V}}(Y) = {\mathbb{E}}[{\mathbb{V}}(Y\mid X)] + {\mathbb{V}}({\mathbb{E}}[Y\mid X])\)
This implies the decomposition of heterogeneity in outcomes:
\[ \underbrace{\sum_{j=1}^G \sum_{i=1}^{N_j} (Y_{ij} - \overline{Y})^2}_{\text{overall variance, } \sigma^2} = \underbrace{\sum_{j=1}^G \sum_{i=1}^{N_j} (Y_{ij} - \overline{Y}_{j})^2}_{\text{within-cluster variance, } \sigma^2_W} + \underbrace{\sum_{j=1}^G N_{j}(\overline{Y}_{j} - \overline{Y})^2,}_{\text{between-cluster variance, } \sigma^2_B} \]
where \(\overline{Y_{j}}\) is mean of \(Y_{ij}\) in cluster \(j\) and \(\overline{Y}\) is mean of all \(Y_{ij}\).
Then we can define the intracluster correlation: \(\rho = \frac{\sigma^2_B}{\sigma^2} = 1 - \frac{\sigma^2_W}{\sigma^2}\)
Intuition: When \(\rho\) is \(1\) (\(0\)), responses are identical (uncorrelated) within each cluster.
\[ \frac{{\mathbb{V}}(\hat\tau_{CL})}{{\mathbb{V}}(\hat\tau_{R})} = 1 + (\overline{N} - 1)\rho, \quad \text{where} \quad \bar{N} = \frac{1}{G} \sum_{j=1}^G N_j \]
Valid inference:
Note: When \(G\) is small, \(\rho\) will be poorly estimated and cluster SEs will be unreliable \(\rightsquigarrow\) prefer increasing \(G\) over sample size per cluster (\(N_j\)).
Ignoring clustering in standard error estimates can result in overly “optimistic” confidence intervals and increased Type I error (false-positives) rates.
The simulation shows how coverage–probability of 95% CIs including the true ATE across replications of the experiment–is affected by ratio of between- to within-cluster variance:
Wantchekon (2003)
We discussed pros/cons for covariate adjustment after randomization
But, why not do adjustment before randomization?
Basic idea: If you have data on pre-treatment characteristics \(X_i\), why leave it to pure chance, to balance them?
Example: \(n = 4\) with two males and two females.
Complete randomization will place two females in the same treatment group \(\frac{1}{3}\) of the time.
If that happens, how can we tell the treatment effect from gender difference?
Solution: Pre-stratify the sample, and then randomize completely within each stratum
Note
\(\rightsquigarrow\) “Block what you can; randomize what you cannot.” (George Box)
In GOTV experiment, what if we have previous turnout data from the voter file?
\[ \tau_\text{v} = \frac{1}{N_\text{v}} \sum_{i:V_i=1} [ Y_i(1) - Y_i(0) ], \qquad \tau_\text{nv} = \frac{1}{N_\text{nv}} \sum_{i:V_i=0} [ Y_i(1) - Y_i(0) ] \]
\[ \tau_{SATE} = \underbrace{\left( \frac{N_\text{v}}{N_\text{v} + N_\text{nv}} \right)}_{\text{share voters}} \tau_\text{v} + \underbrace{\left( \frac{N_\text{nv}}{N_\text{v} + N_\text{nv}} \right)}_{\text{share non-voters}} \tau_\text{nv} \]
Block (stratified) randomized experiment:
Probability of treatment in each group called the propensity score:
Blocking ensures balance across blocks:
\[ \begin{align*} \widehat{\tau}_\text{v} &= \overline{Y}_{1,\text{v}} - \overline{Y}_{0,\text{v}} = \frac{1}{N_{1,\text{v}}} \sum_{i:V_i=1} T_i Y_i - \frac{1}{N_{0,\text{v}}} \sum_{i:V_i=0} (1 - T_i) Y_i \\ \widehat{\tau}_\text{nv} &= \overline{Y}_{1,\text{nv}} - \overline{Y}_{0,\text{nv}} = \frac{1}{N_{1,\text{nv}}} \sum_{i:V_i=1} T_i Y_i - \frac{1}{N_{0,\text{nv}}} \sum_{i:V_i=0} (1 - T_i) Y_i \end{align*} \]
\[ \widehat{\tau}_{BR} = \left(\frac{N_\text{v}}{N}\right) \widehat{\tau}_\text{v} + \left(\frac{N_\text{nv}}{N}\right) \widehat{\tau}_\text{nv} \]
\[ {\mathbb{V}}[\widehat{\tau}_\text{v} {\:\vert\:}\mathcal{O}] = \frac{S^2_{1,\text{v}}}{N_{1,\text{v}}} + \frac{S^2_{0,\text{v}}}{N_{0,\text{v}}} - \frac{S^2_{\tau,\text{v}}}{N_{\text{v}}}, \]
where \(S^2_{t,\text{v}}\) is the within-block sample variances of the \(T_i = t\) potential outcomes or \(\tau\).
\[ {\mathbb{V}}[\widehat{\tau}_{BR} {\:\vert\:}\mathcal{O}] = \left(\frac{N_\text{v}}{N}\right)^2 {\mathbb{V}}[\widehat{\tau}_\text{v} {\:\vert\:}\mathcal{O}] + \left(\frac{N_\text{nv}}{N}\right)^2 {\mathbb{V}}[\widehat{\tau}_\text{nv} {\:\vert\:}\mathcal{O}] \]
\[ \widehat{\sigma}_{BR} = \left(\frac{N_\text{v}}{N}\right)^2 \left(\frac{\widehat{\sigma}^2_{1,\text{v}}}{N_{1,\text{v}}} + \frac{\widehat{\sigma}^2_{0,\text{v}}}{N_{0,\text{v}}}\right) + \left(\frac{N_\text{nv}}{N}\right)^2 \left(\frac{\widehat{\sigma}^2_{1,\text{nv}}}{N_{1,\text{nv}}} + \frac{\widehat{\sigma}^2_{0,\text{nv}}}{N_{0,\text{nv}}}\right), \]
where \(\widehat{\sigma}^2_{t,\text{v}}\) are the within-strata observed outcome variances.
Blocks, \(j \in \{1, \dots, J\}\).
\[ \begin{align*} \widehat{\tau}_j &= \frac{1}{N_{1,j}} \sum_{i:B_i=j} T_i Y_i - \frac{1}{N_{0,j}} \sum_{i:B_i=j} (1 - T_i) Y_i, \\ \widehat{\sigma}_j &= \frac{\widehat{\sigma}^2_{1,j}}{N_{1,j}} + \frac{\widehat{\sigma}^2_{0,j}}{N_{0,j}} \end{align*} \]
\[ \widehat{\tau}_{BR} = \sum_{j} w_j \widehat{\tau}_j, \qquad \widehat{\sigma}_{BR} = \sum_{j} w_j^2 \widehat{\sigma}(\widehat{\tau}_j) \]
Efficiency of block versus complete randomization (R) depends on the sampling scheme.
\[ \begin{align*} B &= \sum_{j=1}^{J} \left(\frac{N_j}{N}\right) (\overline{Y}_j(1) + \overline{Y}_j(0) - (\overline{Y}(1) + \overline{Y}(0)))^2 \\ W &= \sum_{j=1}^{J} \frac{N_j}{N} \frac{N_{1,j} N_{0,j}}{N_j} \widehat{\sigma}(\widehat{\tau}_j {\:\vert\:}\mathcal{O}) \end{align*} \]
Difference can be positive or negative (Pashley and Miratrix 2022):
Discrete covariates \(\rightsquigarrow\) blocks by unique combinations.
Alternative: create blocks by forming homogeneous groups in \(\mathbf{X}\).
\[ M(\mathbf{X}_i, \mathbf{X}_k) = \sqrt{(\mathbf{X}_i - \mathbf{X}_k)' \hat{V}(\mathbf{X})^{-1} (\mathbf{X}_i - \mathbf{X}_k)} \]
Challenges:
optmatch and blockTools allow to perform matching.blockToolspacman::p_load(
blockTools, randomizr, RItools
)
set.seed(20250211)
# simulate some data
N <- 100
data <- data.frame(
id = 1:N,
female = sample(c(0, 1), N, replace = TRUE),
age = round(truncnorm::rtruncnorm(N, a = 18, b = 80, mean = 30, sd = 10)),
education = sample(1:4, N, replace = TRUE) # 1: High School, 2: College, etc.
)
# form blocks using gender and age
blocks <- block(
data,
id.vars = "id",
groups = "female",
n.tr = 2,
block.vars = c("age"),
distance = "mahalanobis"
)
# add block ids and random assignment
data <-
data |>
dplyr::mutate(
block_id =
blockTools::createBlockIDs(
obj = blocks, data = data, id.var = "id"),
treat1 =
randomizr::complete_ra(N = n()),
treat2 =
randomizr::block_ra(
blocks = female,
prob = 0.5),
treat3 =
randomizr::block_ra(
blocks = block_id,
prob = 0.5)
)
# output balance tests
out <-
lapply(
1:3,
function(x) {
RItools::xBalance(
fmla = as.formula(paste0("treat", x, "~ education + female + age")),
data = data,
report = c("adj.means", "std.diffs", "p.values"))$results[,,1] |>
knitr::kable(
digits = 3, align = "cccc",
caption = c("Complete RA", "Block by female", "Block by female and age")[x]) |>
kableExtra::kable_minimal(font_size = 20)
}
)
out[[1]]; out[[2]]; out[[3]]| Control | Treatment | std.diff | p | |
|---|---|---|---|---|
| education | 2.58 | 2.22 | -0.327 | 0.105 |
| female | 0.42 | 0.64 | 0.447 | 0.028 |
| age | 32.84 | 32.16 | -0.086 | 0.664 |
| Control | Treatment | std.diff | p | |
|---|---|---|---|---|
| education | 2.569 | 2.224 | -0.312 | 0.121 |
| female | 0.529 | 0.531 | 0.002 | 0.990 |
| age | 31.039 | 34.020 | 0.386 | 0.057 |
| Control | Treatment | std.diff | p | |
|---|---|---|---|---|
| education | 2.54 | 2.26 | -0.253 | 0.207 |
| female | 0.52 | 0.54 | 0.040 | 0.842 |
| age | 32.88 | 32.12 | -0.097 | 0.627 |
\[ Y_i = \alpha + \tau T_i + \sum_{j=2}^{M} \beta_j B_{ij} + \epsilon_i, \quad \text{where} \quad {\mathbb{E}}[\widehat{\tau}_{OLS}] = \tau \]
\[ \forall i, j:\: w_{ij} = \begin{cases} \frac{1}{p_j}, & \text{if } T_i = 1 \\ \frac{1}{(1 - p_j)}, & \text{if } T_i = 0 \end{cases} \]
Recall that for a statistical test:
Statistical power of a design + test \(\equiv\) the test’s probability of rejecting the null in favor of the alternative when the alternative is indeed true, given the design.
Example: \(H_0:\: \beta = 0\) and \(H_a:\: |\beta| = \tau > 0\), giving rise to a two sided test. Then power is given by, \(\kappa = {\textrm{Pr}}[\text{Reject } H_0 {\:\vert\:}H_a \text{ is true; design, test}]\).
What does power depend on?
Consider a randomized experiment with complete randomization.
Problem: For a given total sample size \(N\), choose the optimal treatment allocation \(p = N_1/N\) to minimize the variance of the estimator of the average treatment effect.
\[ {\mathbb{V}}(\hat\tau) = \frac{\sigma^2_1}{p N}+\frac{\sigma^2_0}{(1-p) N} \]
\[ -\frac{\sigma^2_1}{p^{*2} N}+\frac{\sigma^2_0}{(1-p^*)^2 N}=0 \]
Therefore:
\[ \frac{1-p^*}{p^*} = \frac{\sigma_0}{\sigma_1} \implies p^* = \frac{\sigma_1}{\sigma_1+\sigma_0}=\frac{1}{1+\sigma_0/\sigma_1} \]
Intuition: A “rule of thumb” if you can assume \(\sigma_1\approx \sigma_0\) is to have \(p^{*}=0.5\)
For practical reasons it is sometimes better to choose unequal sample sizes (even if \(\sigma_1\approx \sigma_0\)).
# function to calculate variance of DiM
variance_dim <-
function(p, N, sigma1 = 1, sigma0 = 1) {
(sigma1^2 / (p * N)) + (sigma0^2 / ((1 - p) * N))
}
# create a sequence of assignment p's
# calculate variance for each p
variance_data <-
tibble(
N = 100,
p = seq(0.01, 0.99, by = 0.01),
variance =
map2_dbl(
N, p,
\(x, y) variance_dim(p = y, N = x)))
# plot using ggplot2
ggplot(variance_data, aes(x = p, y = variance)) +
geom_line(color = "#458588", size = 1) +
labs(
x = bquote("Proportion of Treatment Group, " ~ N[1]/N),
y = bquote("Variance of DiM, " ~ V(bar(Y)(1) - bar(Y)(0)))
) +
theme(text = element_text(size = 16))Suppose that \(\sigma^2_0=\sigma^2_1\) and \(Y_i (0) \sim (\mu_0, \sigma^2)\) and \(Y_i (1) \sim (\mu_1, \sigma^2)\)
Assume also that \(p=0.5\), so \(N_0=N_1=N/2\), and \(\tau=\mu_1-\mu_0\).
\[ \frac{\widehat{\tau}_{DiM}-\tau}{\sqrt{\frac{\sigma^2_1}{N_1}+\frac{\sigma^2_0}{N_0}}} = \frac{\widehat{\tau}_{DiM}-\tau}{\sqrt{\frac{2\sigma^2}{N}+\frac{2\sigma^2}{N}}} = \frac{\widehat{\tau}_{DiM}-\tau}{2\sigma/\sqrt{N}} \sim \mathcal{N} (0,1). \]
\[ t = \frac{\widehat{\tau}_{DiM}}{\sqrt{\frac{\sigma^2_1}{N_1}+\frac{\sigma^2_0}{N_0}}} \sim \mathcal{N}\left(\frac{\tau\sqrt{N}}{2\sigma},1\right) \]
\[ \begin{align*} {\textrm{Pr}}\left(|t| > 1.96\right) &= {\textrm{Pr}}\left(t < -1.96\right) + {\textrm{Pr}}\left(t > 1.96\right) \\ &= {\textrm{Pr}}\left(t-\frac{\tau \sqrt N}{2\sigma} < -1.96 - \frac{\tau \sqrt N}{2\sigma}\right) \\ &\qquad + {\textrm{Pr}}\left(t-\frac{\tau \sqrt N}{2\sigma}>1.96-\frac{\tau \sqrt N}{2\sigma}\right) \\ &= \Phi\left(-1.96-\frac{\tau \sqrt N}{2\sigma}\right) + \left(1-\Phi\left(1.96-\frac{\tau \sqrt N}{2\sigma}\right)\right) \end{align*} \]
\[ \begin{align*} {\textrm{Pr}}(\text{reject } \mu_1-\mu_0=0 &| \mu_1-\mu_0=\tau) = \\ & \Phi\left(-1.96-\tau\Bigg/\sqrt{\frac{\sigma_1^2}{p N}+\frac{\sigma_0^2}{(1-p)N}}\right) \\ & \qquad + \left(1-\Phi\left(1.96-\tau\Bigg/\sqrt{\frac{\sigma_1^2}{p N}+\frac{\sigma_0^2}{(1-p)N}}\right)\right) \end{align*} \]
To choose \(N\) we need to specify:
Power calculations for sample size assuming
Testing \(H_0 : \beta = 0\) relative to \(H_a : |\beta| = \mu > 0\).
Large-sample distribution of \(\hat{\beta}\) under \(H_0\), denoted by \(F_{\hat{\beta}}\bigl(\cdot \mid \beta = 0\bigr)\).
For a test at \(100\bigl(1 - \alpha\bigr)\%\) confidence, the distribution under \(H_0\) defines the “rejection region.”
Large-sample distribution of \(\hat{\beta}\) under \(H_a\), denoted by \(F_{\hat{\beta}}\bigl(\cdot {\:\vert\:}\beta = b_{H_a}\bigr)\).
Power is the probability of falling in the “rejection region”:
\[ \kappa = 1 - F_{\hat{\beta}}\bigl(t_{\alpha/2} \sigma_{\hat{\beta}} \mid \beta = b_{H_A}\bigr) \]
For a test with power \(\kappa\), need \(b_{H_a}\) as in the picture.
\[ \left| \beta_{\alpha, \kappa, \sigma_{\hat{\beta}}^{(0)}, \sigma_{\hat{\beta}}^{(a)}} \right| = t_{\alpha/2} \sigma_{\hat{\beta}}^{(0)} + t_{1-\kappa} \sigma_{\hat{\beta}}^{(a)} \]
where,
Let \(\sigma_{\hat{\beta}} = \max\bigl(\sigma_{\hat{\beta}}^{(0)}, \sigma_{\hat{\beta}}^{(a)}\bigr)\).
Then define a conservative MDE as
RESULT: Minimum Detectable Effect
\[ \underbrace{\left| \beta_{\alpha, \kappa, \sigma_{\hat{\beta}}} \right|}_{MDE} = \bigl(t_{\alpha/2} + t_{1-\kappa} \bigr) \sigma_{\hat{\beta}}. \]
Suppose we test against \(\mathcal{N} (0,1)\) with \(\alpha = 0.05\) and \(\kappa = 0.80\). These are the standards in social science. Then We have, \(t_{\alpha/2} + t_{1-\kappa} = |z_{.025}| + |z_{.2}| = 1.96 + 0.84 = 2.80\).
Therefore, the MDE is 2.8 times the conservative standard error of the effect estimator for any study for which we use
In all applications we have studied, the normal or \(t\) distribution is a good approximation so long as the sample size is not too small.
\[ {\mathbb{V}}[\hat{\beta}] = \frac{1}{N} \left( \frac{\sigma^2_{1}}{p} + \frac{\sigma^2_{0}}{1 - p} \right) \]
\[ \begin{align*} MDE &= \bigl(t_{\alpha/2} + t_{1-\kappa} \bigr) \sqrt{\frac{1}{N} \left( \frac{\sigma^2_{1}}{p} + \frac{\sigma^2_{0}}{1 - p} \right)} \\ \implies N &= \frac{\bigl(t_{\alpha/2} + t_{1-\kappa} \bigr)^2 \left( \frac{\sigma^2_{1}}{p} + \frac{\sigma^2_{0}}{1 - p} \right)}{MDE^2}. \end{align*} \]
Note Use standardized effect sizes to avoid “power fallacy”!
Statistical power is the probability of rejecting the null hypothesis when it is indeed false.
Optimal design considerations:
Practical considerations:
Further reading: Blair, Coppock, and Humphreys (2023, Ch. 10-11).
In OLS, adding a regressor orthogonal to existing regressors doesn’t change their coefficients–it only reduces residual variance \(\Rightarrow\) smaller SEs.
In a randomized experiment, \(T_i {\mbox{$\perp\!\!\!\perp$}}X_i\) by design, so \({\mathrm{corr}}(T_i, X_i) \approx 0\). Great!
But for the interaction \(T_i X_i\) (without demeaning):
\[ \begin{align*} {\mathrm{cov}}(T_i, T_i X_i) &= {\mathbb{E}}[T_i^2 X_i] - {\mathbb{E}}[T_i] {\mathbb{E}}[T_i X_i] \\ &= \bigl({\mathbb{E}}[T_i^2] - {\mathbb{E}}[T_i]^2\bigr) {\mathbb{E}}[X_i] \neq 0 \;\text{ if } {\mathbb{E}}[X_i] \neq 0 \end{align*} \]
So \(T_i\) and \(T_i X_i\) are correlated \(\Rightarrow\) \(\widehat{\tau}\) estimates the effect at \(X = 0\), not the ATE.
With demeaning: \({\mathbb{E}}[\tilde{X}_i] = 0 \implies {\mathrm{cov}}(T_i, T_i \tilde{X}_i) = 0\) by construction.
Orthogonality restored \(\Rightarrow\) \(\widehat{\tau}_{Lin}\) is shielded from the interaction terms.
\[ \begin{aligned} &{\mathbb{E}}[ \widehat{\tau}_{IPW} {\:\vert\:}\mathcal{O}_N ] \\ &= {\mathbb{E}}\left[ \frac{1}{N} \sum_{i=1}^N \left\{\frac{T_iY_i}{p} - \frac{(1 - T_i) Y_i}{(1 - p)}\right\} \Bigg| \mathcal{O}_N \right] \\ &= \frac{1}{N} \sum_{i=1}^N \left\{ {\mathbb{E}}\left[ \frac{T_i Y_i (1)}{p} {\:\vert\:}\mathcal{O}_N \right] - {\mathbb{E}}\left[ \frac{(1 - T_i) Y_i (0)}{(1 - p)} {\:\vert\:}\mathcal{O}_N \right] \right\} \quad \text{($\because$ distribute ${\mathbb{E}}$/random assignment)}\\ &= \frac{1}{N} \sum_{i=1}^N \left\{ \frac{Y_i(1)}{p} {\mathbb{E}}[ T_i {\:\vert\:}\mathcal{O}_N ] - \frac{Y_i(0)}{1 - p} {\mathbb{E}}[ 1 - T_i {\:\vert\:}\mathcal{O}_N ] \right\} \quad \text{($\because$ POs are fixed)}\\ &= \frac{1}{N} \sum_{i=1}^N \left\{ \frac{Y_i (1)}{p} \cdot p - \frac{Y_i (0)}{1 - p} \cdot (1 - p) \right\} \quad \text{($\because$ definition of ${\mathbb{E}}$)}\\ &= \frac{1}{N} \sum_{i=1}^N Y_i(1) - Y_i(0) = \tau_{SATE} \end{aligned} \]
\[ \begin{aligned} {\mathrm{cov}}(\overline{Y}_1, \overline{Y}_0 {\:\vert\:}\mathcal{O}_N) &= \frac{1}{N_1 N_0}\sum_i\sum_j Y_i(1)\,Y_j(0)\;{\mathrm{cov}}(T_i,\;1-T_j) = -\frac{1}{N_1 N_0}\sum_i\sum_j Y_i(1)\,Y_j(0)\;{\mathrm{cov}}(T_i, T_j) \\[6pt] &\text{Under CR: } {\mathbb{V}}(T_i) = \frac{N_1 N_0}{N^2}, \quad {\mathrm{cov}}(T_i, T_j)\big|_{i \neq j} = -\frac{N_1 N_0}{N^2(N-1)} \\[6pt] &= \underbrace{-\frac{1}{N^2}\sum_i Y_i(1)Y_i(0)}_{\text{diagonal } (i=j)} \;+\; \underbrace{\frac{1}{N^2(N-1)}\sum_{i \neq j} Y_i(1)Y_j(0)}_{\text{off-diagonal}} \\[6pt] &= -\frac{1}{N^2}\sum_i Y_i(1)Y_i(0) + \frac{1}{N^2(N-1)}\left[N^2\,\overline{Y(1)}\,\overline{Y(0)} - \sum_i Y_i(1)Y_i(0)\right] \\[6pt] &= -\frac{\sum_i Y_i(1)Y_i(0)}{N^2}\cdot\frac{N}{N-1} + \frac{\overline{Y(1)}\,\overline{Y(0)}}{N-1} = -\frac{1}{N}\underbrace{\frac{1}{N-1}\left[\sum_i Y_i(1)Y_i(0) - N\,\overline{Y(1)}\,\overline{Y(0)}\right]}_{= S_{10}} = -\frac{S_{10}}{N} \end{aligned} \]
\[ {\mathbb{V}}[\hat \tau ] = {\mathbb{V}}\left [\sum_i T_i \varepsilon_i \right ] / \left ( \sum_i T_i^2 \right )^2 \]
If we assume:
\[ {\mathbb{V}}_{\textrm{CR}}[\hat \tau] = \left (\sum_i \sum_j \textcolor{blue}{T_i T_j} \textcolor{red}{{\mathrm{cov}}[\varepsilon_i, \varepsilon_j]} \mathbb{1}[i,j \textrm{ in the same cluster}] \right) / \left(\sum_i T_i^2 \right )^2 \]
When it is possible, a randomized experiment is the best approach for estimating causal effects
Identification is justified by design + Estimation & Inference are simple
High Internal Validity: we can estimate the SATE without bias, without making strong modeling assumptions
Common concern: External Validity
Egami and Hartman (2023) propose a framework for systematic sources of external validity: